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ABSTRACT Interlocus gene conversion is a major evolutionary force that drives the concerted evolution of 
duplicated genomic regions. Theoretical models successfully have addressed the effects of interlocus gene 
conversion and the importance of crossover in the evolutionary fate of gene families and duplications but 
have not considered complex recombination scenarios, such as the presence of hotspots. To study the 
interplay between interlocus gene conversion and crossover, we have developed a forward-time simulator 
that allows the exploration of a wide range of interlocus gene conversion rates under different crossover 
models. Using it, we have analyzed patterns of nucleotide variation and linkage disequilibrium within and 
between duplicate regions, focusing on a neutral scenario with constant population size and validating our 
results with the existing theoretical models. We show that the interaction of gene conversion and crossover 
is nontrivial and that the location of crossover junctions is a fundamental determinant of levels of variation 
and linkage disequilibrium in duplicated regions. We also show that if crossover activity between 
duplications is strong enough, recurrent interlocus gene conversion events can break linkage disequilibrium 
within duplicates. Given the complex nature of interlocus gene conversion and crossover, we provide 
a framework to explore their interplay to help increase knowledge on molecular evolution within segmental 
duplications under more complex scenarios, such as demographic changes or natural selection. 
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Gene duplication has been recognized as a primary source of genetic 
innovation since Ohno's seminal work on this topic in 1970 (Ohno 
1970). The accumulation of genomic sequence data from a wide range 
of species has shown that segmental duplications (SDs) spanning 
more than 1 kb and retaining a high degree of sequence homogeneity 
(>90%) are a pervasive feature of eukaryotic genomes (Lynch and 
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Conery 2000; Bailey et al. 2002; Marques-Bonet et al. 2009). Under- 
standing the molecular evolution of duplicated nucleotide sequences 
is also of great relevance because SDs define hotspots of chromosomal 
rearrangements and are known to give rise to copy-number variants 
(Sharp et al. 2005; Mills et al. 2011; Uddin et al. 2011), which in turn 
frequently are implicated in disease susceptibility (Conrad et al. 2010; 
Stankiewicz and Lupski 2010) and are targets of natural selection 
(Gazave et al. 2011; Iskow et al. 2012; Lorente-Galdos et al. 2013). 

During the 1980s, extensive theoretical work was carried out on 
the subject of multigene family evolution (Baltimore 1981; Dover 
1982; Ohta 1982, 1983; Nagylaki and Petes 1982; Nagylaki 1984). 
More recendy, Innan and collaborators (Innan 2002, 2003; Teshima 
and Innan 2004; Mano and Innan 2008) and Thornton (2007) have 
successfully combined analytical results and coalescent simulations to 
study diversity patterns and divergence times between duplicates and 
have analyzed various selection scenarios. StiU, many unique features 
of SDs have not yet been fuUy addressed, neither analytically or by 
simulation. 
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The foremost feature of duplicated genomic regions is that they 
undergo concerted evolution due to the exchange of genetic in- 
formation via interlocus gene conversion (IGC) (Nagylaki and Petes 
1982; Ohta 1982). IGC can be described as a copy-paste event (Innan 
2009) whereby a fragment of one of the duplicated segments is copied 
onto the corresponding segment of the duplicate (Wiuf and Hein 
2000). IGC is also referred to as nonallelic, interparalog, or ectopic 
gene conversion. IGC is known to be a major player in small (two- 
copy) multigene family evolution (Ohta 1983), and empirical esti- 
mates of IGC rates span several orders of magnitude (Chen et al. 
2007; Benovoy and Drouin 2009; McGrath et al. 2009). The extent 
to which the theory from multigene family evolution can be appUed 
to the evolution of SDs and copy-number variants remains an open 
question. In particular, because SDs can span large regions, some ge- 
nomic features, such as the pervasiveness of crossover hotspots in the 
human genome (Jeffreys et al. 2001; Kong et al. 2002), might have non- 
negligible effects on the levels of variation within and between SDs. The 
evolution of SDs under different crossover models, including the over- 
lap of IGC susceptible regions with crossover hotspots, or the linkage 
disequilibrium (LD) patterns within and between SDs, are some of the 
issues that, to our knowledge, have not been addressed before. 

To tackle some of these questions, a flexible forward-time simulator 
should prove to be a powerful tool. Although coalescent simulators have 
the advantage of being extremely fast, forward-time simulators allow 
the exploration of a wider parameter space and the implementation of 
different scenarios (such as sophisticated crossover models) in a straight- 
forward way. We here introduce a forward-time algorithm in C++ and 
explore a wide range of realistic parameter values in a model of neutral 
evolution of duplicated chromosomal regions undergoing mutation, 
IGC, and crossover under different models. 

MATERIALS AND METHODS 

We model a Wright-Fisher population of N diploid individuals, where 
each of the 2N chromosomes consists of either two or three blocks. 
Each block is L nucleotides long and is characterized by an ordered set 
of integers (representing nucleotide positions) between 0 and L — 1. 

Simulation procedure: phases 

Simulations have three phases, the bum-in phase, the structured phase, 
and the concerted evolution phase. Each simulation begins with 
a burn-in phase of Tj generations in which the population undergoes 
random mating, mutation, and recombination by crossover. During 
this first phase, each chromosome is composed of two blocks that we 
will refer to as original and single-copy blocks. We run this first phase 
long enough for the population to reach equilibrium. At t = Tj + 1, the 
structured phase starts with a duplication event in which the original 
block of a randomly chosen chromosome is duplicated and "pasted" 
to the right of its single-copy block (Figure 1). We shall refer to this 
third block as duplicated block. This unique duplication is fixed by 
drift. To achieve fixation in a computationally efficient way, a neutral 
fixation trajectory is simulated using Kimura's pseudo-sampling method 
(Kimura 1980). During the structured phase, of average T2 = 4N gen- 
erations (and maximum T2 = 20N generations), the number of chromo- 
somes carrying the duplication in each generation is determined by the 
fixation trajectory. Once the fixation of the duplication has occurred, the 
program proceeds to a concerted evolution phase (of = T-j- - Ti - T2 
generations) tmtil t = T-r- 

Mutation 

Mutation is active during all phases and occurs independently in each 
block at a rate [jl per site per generation. We use a pseudo infinite-sites 



model (Padhukasahasram et al. 2008) in which the number of sites in 
each block is finite (L), but mutations can only appear in nonpoly- 
morphic positions. For mutational purposes, a variant that occurs in 
either the original or the duphcated block is still considered as poly- 
morphic even if it is segregating only in one of them, and therefore 
any mutation is only considered as fixed if it is present in all chro- 
mosomes on both blocks. Multiple mutations are not allowed in the 
same block and generation. We have verified that allowing multiple 
mutations does not change the results presented in this work. 

Crossover 

Homologous crossover occurs at rate r per meiosis, with a maximimi 
of one crossover allowed per meiosis. Most previous studies of multigene 
family evolution (Ohta 1983; Nagylaki 1984; Innan 2002; Thornton 
2007) consider intergenic crossover only. In other words, previous mod- 
els are such that meiotic crossover junctions are only allowed be- 
tween duplicated blocks (usually called "genes" in these papers) but not 
within them. We will refer to this model as the single-copy crossover 
(SCC) model since all crossover junctions fall within the single-copy 
block. Here, we explore two additional crossover models: the whole- 
region crossover (WRC) model, that allows crossover junctions from 5' 
end of one duplicate to 3' end of the other duplicate, which so far has 
only been used by Teshima and Innan (2012); and the hotspot crossover 
(HSC) model, in which a particular region of the simulated segment is 
predefined as a crossover hotspot such that all crossover junctions fall 
within this region. Choosing an adequate definition of r is not straight- 
forward, given that chromosomes with different lengths (two or three 
blocks) segregate at changing frequencies in the population, and that 
different models have different a priori distributions of crossover events. 
To facilitate comparison among models, we define r per meiosis, irre- 
spective of the lengths of the chromosomes involved. In other words, 
the same average number of crossovers occurs per meiosis per gener- 
ation under any model and phase. This implies, for example, that under 
the WRC model, the effective crossover rate per base pair is higher 
during the burn-in phase (two blocks) than during the concerted evo- 
lution phase (three blocks). The effects of this choice are always small, 
do not affect measures taken at equilibrium, and are easier to account 
for than those that would be introduced by other potential definitions of 
r. We will refer to the population scaled crossover rate as R = 4Nr. At 
each crossover event, a crossover junction is randomly selected from the 
region allowed by the underlying crossover model. 

Interlocus gene conversion 

At every generation, if a chromosome contains the duplication, there 
is a per site probability of an IGC event being initiated, g. An IGC 
event can be described as a copy-paste event (Innan 2009) in which 
the information {i.e., all derived mutations) contained along an IGC 
tract on one of the duplicate blocks is copied and pasted onto the 
other duplicate substituting any variants that were previously within 
the paralogous tract in that dupUcate. In our model, an IGC event can 
be initiated at any position of any duplicate block according to a uni- 
form distribution. The length of IGC tracts is determined according to 
a geometric distribution with average size A, which fits available em- 
pirical evidence (Wiuf and Hein 2000). Following Wiuf and Hein 
(2000), the tract length for every IGC event, /, is extracted from a geo- 
metric distribution with parameter q = A/L; 

P,{l) = q{l-q)'-K (1) 

Instead of randomly choosing the direction in which the IGC tract 
extends (either 3' or 5'), we determine the IGC tract extending 1/2 
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Figure 1 The three phases of every simulation run. Each simulation begins with a burn-in phase, in which a population formed by chromosomes 
with two single-copy blocks (dark blue and orange) is brought to mutation-drift equilibrium. The duplication of the first of these blocks (original 
block: dark blue, duplicated block: light blue) marks the initiation of the structured phase, during which the duplication becomes fixed. Finally, 
during the concerted evolution phase, the population reaches a new equilibrium in which the interplay of interlocus gene conversion between 
duplicated blocks and crossover determines levels and patterns of variability. 



sites to the left and 112 to the right of the initiation point or junction. 
Mansai et al. (2011) proposed a similar model in which there is an 
independent exponential elongation of the gene conversion tract in 
both directions from the initiation point. We deviate from previous 
models (Wiuf and Hein 2000; Thornton 2007; Mansai et al. 2011) in 
limiting IGC tracts to the duplicate regions by the simple procedure 
of truncating any tract that extends beyond the duplicate blocks. We 
expect that this simplification has a negligible impact on our results. 
In our model, IGC only occurs between paralogous regions on the 
same chromosome and both copies act as donor or receptor with 
equal probability. The mutation positions found in the donor tract 
are imported to the receptor, and the mutations in the receptor tract 
are erased. Finally, we allow IGC to occur regardless of the diver- 
gence between blocks on the same chromosome (Supporting Infor- 
mation, File SI). Other authors, notably Teshima and Innan (2004), 
implemented a divergence threshold above which IGC is terminated. 
The IGC rate per site per generation will be c = ^A, where A is the 
effective average length of the IGC tracts (although not accounting 
for possible truncations) and the population IGC rate will be 
C = 4Nc. 

Eras: simulating neutral genealogies 

The neutral scenario allows us to decrease the running time of 
simulations by avoiding the explicit simulation of chromosomes that 
will eventually leave no offspring in the population (Padhukasahasram 
et al. 2008). This requires simulating a genealogy for k generations in 
advance and then tracing it back to ascertain which chromosomes, at 
each generation, will make a contribution to the final population. For 
the sake of clarity, we will refer to each period of k generations as an 
Era. Because there is a trade-off between the time it takes to simulate 
the genealogy and the time saved by not simulating those chromo- 
somes destined to be lost, we need to choose an optimal value of k. 
This value varies considerably with population size, crossover rate, 
and other parameters, so we selected k by performing trial runs for 
every combination of parameters (not shown). 

Eras start at generation t' = 0. To construct the genealogy for each 
Era, each chromosome at each of the k generations is randomly 
assigned a parental chromosome from the previous generation. When 
R = 0, the process is straightforward: first, all chromosomes at gener- 
ation t' =k — 1, are tagged us fertile; next, the parental chromosome of 



every fertile individual nt t' = k — 1 is tagged as fertile at t' = k — 2; 
and the same is repeated for every generation until t' = 0. The case 
with crossover (R > 0) is essentially the same except that when 
tagging the parental chromosomes as fertile, there is a probability, r, 
that the partner of that chromosome (i.e., the other chromosome from 
the same individual) is also tagged as fertile. 

During the structured phase, the number of chromosomes 
carrying the duplication {s(t')) is determined by a neutral fixation 
trajectory. As mentioned previously, the structured phase begins with 
a unique duplication event in a randomly chosen chromosome {i.e. 
s{t'=0) = 1). At t' = 1, s(l) randomly chosen chromosomes will be 
assigned the chromosome carrying the duplication as their parental 
chromosome. The rest of the 2JV - s(l) chromosomes will be ran- 
domly assigned a parental chromosome not carrying the duplication 
(at this point, any other chromosome). At f = 2, s(2) randomly chosen 
chromosomes wiU be assigned any of the s(l ) chromosomes carrying 
the duplication at f' = i as their parental chromosome and the rest 
of the 2N - s(2) chromosomes will be randomly assigned a parental 
chromosome not carrying the duplication at f' = i. This process is 
continued until t' = k - 1. Typically, because the average neutral 
fixation time is 4N generations, fixation trajectories will be larger 
than k. Thus, the aforementioned process will frequently exceed 
k generations. 

Before the duplication reaches fixation, crossover might occur 
between chromosomes with a different number of blocks. In this case, 
irrespective of the underlying crossover model, the position of the 
crossover junction will be chosen from their shared length. Without 
loss of generality, the daughter chromosome wiU have the same num- 
ber of blocks as its parental chromosome, regardless of the number of 
blocks of its parental chromosome's partner. To ensure this, the 
daughter chromosome wiU inherit the region from the 5' end until 
the crossover junction (including the junction) from its parental chro- 
mosome's partner and the region from the junction until the 3' end 
from its parental chromosome. At the end of every Era, the mutations 
present on every block in a random sample of « = 50 individuals {i.e., 
100 chromosomes) are recorded. Fixed derived mutations are erased 
from simulated chromosomes after being recorded (mutations on 
duplicated blocks are only erased if they are fixed in both blocks), 
which saves memory and running time. At the end of each simulation 
run, we will have recorded data every k generations. 



r-^G3*Genes | Genomes | Genetics 



Volume 4 August 2014 I Variation and LD in Duplications I 1481 



Variation measures 

From the data recorded every Era, we extract site-frequency spectra 
and variation measurements such as the number of segregating 
sites (S) and the average number of pairwise differences (it). Although 
historically it makes sense to talk about sequence diversity and di- 
vergence when referring to duplicated regions of the genome, from 
here on and for the sake of simplicity, we wiU use the term variation to 
refer to both because we wiU measure both diversity and divergence by 
calculating tt within and between copies. To compare the evolution of 
the duphcate blocks under different parameter values, we explore 
nucleotide variation with the following measures (Figure 2): variation 
within a duphcate block (tt,,) (at equilibrium, variation within the 
original block and within the duplicated block wiU be the same); 
variation between the original and duplicated blocks on different 
chromosomes (iTb); and variation between the original and duplicated 
blocks on the same chromosome (ttJ. In addition, we also calculate 
the variation present within the single-copy block, which will serve 
as control. 

Average values from 10,000 simulations of tt within the original, 
the single- copy, and the duplicated blocks are represented in 
Figure 3A. During the burn-in phase, variation within the original 
and single-copy blocks reaches its neutral expectation tt^ = 0. Once 
the duplication appears (at t = Tj + 1 = 30N), variation begins to 
increase in the duplicated block, and IGC activity begins between the 
original and the duplicated blocks. Distributions of the variation 
within the original and duplicated blocks at different time points 
are shown in Figure 3B. 

LD measures 

To analyze how LD patterns vary with IGC rates and crossover 
models, we have devised a simple way of extracting LD measurements 
from independent simulation runs and of obtaining average values 
from them. To do so, we have binned every block in 100-bp windows. 
For every run, we analyze a sample of « = 50 individuals aX t = Tj-, 
once the population has reached equilibrium. For every pair of win- 
dows X and y, we calculate D' and between all possible pairs of 
mutations (one taken from x and the other from y). We then calculate 
the average value of aU these measurements and thus obtain a value of 
D' and for every pair of windows that is independent of the number 
of mutations that fall within a particular window in a particular run. 
We repeat this process for every simulation run and calculate the 
average values for every pair of windows. During the analysis of LD 



S 




Figure 2 Variation measures. We measure nucleotide variation among 
segmental duplications as follows: (w) variation within duplicate blocks 
(at equilibrium, variation within the original and duplicated blocks will 
be the same); (b) variation between the original and duplicated blocks 
on different chromosomes; (s) variation between the original and du- 
plicated blocks on the same chromosome. We use average pairwise 
differences (tt) to measure all these types of variation (tt„, Hb, tts). 



patterns, we will differentiate between LD within the duplicate regions 
and LD between paralogous windows of the original and dupUcated 
blocks (from here on, we refer to the latter as LD between duplicates). 

Parameter values 

In this paper, we concentrate on the effect of two parameters: the IGC 
rate (C), and the homologous crossover rate (R). Other parameters, 
such as the point mutation rate, only modify the scale and variance of 
our results. To allow for accurate comparisons with results from pre- 
vious works, aU our parameters are population scaled. Population size 
is an important parameter in so far as it does have an effect in 
populations undergoing natural selection. However, because we are 
exclusively considering a neutral scenario, we have fixed our popula- 
tion effective size to N = 1000. The range of parameters we have 
explored is based primarily on previous coalescent simulations (Innan 
2003; Thornton 2007) and a literature survey selecting a range of 
empirical estimates from various species to guarantee that we are 
exploring realistic scenarios. In Table 1, the range of parameters we 
explore here is compared to those in previous works. Details about 
how they were selected can be found in File SI. In addition. File S2 
shows how rates of IGC vary according to the distance between du- 
plicated blocks. In this paper, we exclusively explore duplicate blocks 
of length L separated by a block of equal length. The Perl and R scripts 
to construct the LD plots, as well as the C++ forward simulator used 
throughout this study, are available upon request. 

RESULTS 

The action of gene conversion between the original and duplicated 
blocks increases the amount of variation found within each of them. 
Variation within blocks wiU attain an equilibrium value between 0 
and 20, which wiU depend on IGC and crossover rates. Figure 3 
shows the case for C = 0.5, R = 50, and JV = 1000, for which tt.„ ~ 
1.950. IGC increases variation by transferring mutations that appear 
in one block onto the other block while at the same time preventing 
the fixation of different derived mutations in each block However, 
a clear increment in variation is observed only for a limited range of 
IGC rates. On the one hand, very high IGC rates can homogenize 
copies to such a point that they remain nearly identical and maintain 
a level of variation similar to that of single- copy regions. On the other 
hand, very low IGC rates can increase variation up to 20, but waiting 
times to reach this equilibrium can be extremely high (Nagylaki 1984). 

Increased variation within duplicates caused by IGC activity has 
already been extensively studied and modeled for small and large 
multigene families. We have selected three different models (Ohta 
1983; Innan 2002, 2003) to validate the results of our simulations 
on variation (FUe S3, Figure SI, and Figure S2) and on LD (File S4 
and Figure S3). Given some disparity in nomenclature among this 
body of work, we have introduced a new notation differentiating 
models, which is explained in Table 2. After validation, we proceeded 
to examine in detail the interplay between IGC and crossover in de- 
termining levels and patterns of variation and LD in duplications. In 
order to do so, we analyzed three different crossover models: SCC, in 
which crossover is limited to the single-copy region; WRC, in which 
crossover can occur in the whole simulated region; and HSC, in which 
crossover happens at a previously defined hotspot or hotspots. 

SCC model: single-copy crossover 

Under this model, crossover occurs only in the single-copy region 
between duplicates. The levels of variation found for the SCC model 
fall directly on the theoretical predictions provided by Ohta (1983) 
and Innan (2002, 2003) with a very high degree of accuracy (for 
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Figure 3 Changes in variation within 
blocks along simulations. Average 
results from 10,000 simulation runs 
are shown. (A) Dark blue, orange, 
and light blue curves correspond to 
the average pairwise differences 
found within the original, single- 
copy, and duplicated blocks, re- 
spectively. Gray-shaded areas 
correspond to the burn-in phase, 
structured phase, and concerted 
evolution phase. Duplication occurs 
at t = SON. Although we depict 
the structured phase as ending at 
t = SON, this is actually an arbitrary 
upper limit, because the neutral 
trajectory of the duplicated chro- 
mosomes and length of the struc- 
tured phase is different every 
simulation. As expected, variation 
at equilibrium for the single-copy 

block is 0 = 9/- = 4N]i,L. The original and duplicated blocks attain higher variation (-1.950) due to IGC activity among them. Parameters for 
this simulation are N = 1000, k = 1000, L = 5000, 6 = 0.001, C = 0.5, R = 50, and \ = 100. (B) Distributions of tt„=''" for the original (top) and 
duplicated (bottom) blocks at different times after the appearance of the duplication (t = 31 N, 32N, 34N, 38N, 50N, and 120N) are colored in 
different shades of green. 
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a detailed description and comparison between these models see File 
S3, File S5, Figure SI, Figure S2, and Figure S4). Under the infinite-site 
model, variation within a duplicate at equilibrium is negatively corre- 
lated with IGC rate and positively correlated with crossover rates. The 
explanation for the latter is that the higher the crossover rate, the 
higher the probability for two duplicate blocks that have already un- 
dergone IGC to become separated, thus allowing for another IGC event 
to effectively transfer different variants from one block to another. 

Figure 4 and Figure S5 show LD patterns (D' and r^, respectively) 
for different crossover models and IGC rates. Row 1 shows the case 
for R = 0 and provides a nuU expectation for the effect of IGC in the 
absence of crossover: high LD between duplicates (see Materials and 
Methods), represented by a diagonal dark blue line. Under the SCC 
model (row 2), we observe that in the presence of crossover, the 
pattern of LD in regions undergoing IGC changes considerably. LD 
within the single-copy block wiU be low due to recurrent crossover 
within it, relative to the higher levels of LD in the original and dupli- 
cated blocks where no crossover occurs. However, this only holds for 
low IGC rates, because moderate-to-high IGC between duplicates also 
breaks down LD within them. Furthermore, LD between duphcates 
will always increase with increasing IGC rate but will decrease with the 
rate of crossover. 

In the absence of crossover, moderate IGC decreases the amount 
of LD within duplicate blocks. This very weak effect is expected 



because IGC breaks linkage within blocks by transferring new vari- 
ants from the other duplicate. High IGC does not break this linkage 
since variants are always exchanged between duplicates on the same 
chromosome (Figure 4, row 1). However, when crossover is active, the 
effectiveness of IGC in breaking LD blocks increases dramatically and 
correlates positively with IGC rate and crossover rates (data not shown 
for different crossover rates) (Figure 4, row 2). 

WRC model: whole-region crossover 

Because it is likely that crossover is not restricted to the single-copy 
region between dupUcations, we explore the interplay between IGC 
and crossover by allowing crossover to extend over the whole simulated 
region. Allowing crossover to overlap with regions subject to IGC 
events decreases the expected within-block variation for all IGC and 
crossover rates (Figure 5). Points for the WRC model fall on curves 
corresponding to R' = {2/3)R. That is, given that we are simulating 
three blocks of equal length, allowing crossover to occur on the 
whole region has an effect on variation equivalent to decreasing 
the crossover rate by one third under the SCC model (for a detailed 
explanation, see File S6). This effect is observed for variation within 
blocks (Figure 5A), variation between blocks (Figure SB), and LD 
between duplicates (Figure 5C). 

Row 3 of Figure 4 shows the pattern of LD under the WRC modeL 
As expected, LD blocks present within the original and dupUcated 



■ Table 1 Comparison between parameter ranges explored through simulations by Innan 2003, Thornton 2007, and this work 







Innan 2003 


Thornton 2007 


This Work 


Parameters 


Notation 


Coalescent 


Coalescent 


Forward 


Crossover rate per meiosis 

Mutation rate per block per generation 

IGC rate per duplicated site per generation 

Mean IGC tract length 

Number of loci 

Effective population size 


R = 4Nr 
0 = 4N/^L 
C = 4Nc 
A 

"2 

N 


0, 1, 10 

10 
0.2, 1, 5 
1 mut. 
2 


0, 100, 10000 
10 
0, 1, 10 
10, 500, 900 
2, 5, 8 


0, 1, 10, 50, 100 
5 

0.001, 0.01, 0.1, 0.5, 1, 5, 10, 50 
100 
2 

1000 


IGC, interiocus gene conversion; mut., mutations per 


IGC event. 
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■ Table 2 Nomenclature used throughout this work to distinguish between different types of variation, and allowing for comparison 
between theoretical models by Ohta (1983) and Innan (2002, 2003) 



Type of Variation 




Ohta 1983 
Model A 


Innan 2002 
Model B 


Innan 2003 
Model C 


This Work 
Simulations 


Variation within blocl<s 
Variation between blocl<s on 
Variation between blocks on 


different chromosomes 
the same chromosome 


< = (1 - f)i 

< = 0- C2)L 

< = {1-c,)L 


ttS = E(h„)L 
< = E(ht,)L 


< = E(^b) 





blocks under the SCC model disappear when allowing crossover to 
occur within duplicate blocks. As the rate of IGC increases, LD within 
duplicate blocks decreases, while LD levels and patterns remain con- 
stant within the single- copy block, where no IGC is acting. In accor- 
dance with the reduction of the strength of crossover depicted in 
Figure 5, there is an increase of LD between duplicates imder the 
WRC model compared to the SCC model 

HSC model: hotspot crossover 

Finally, the HSC model considers that crossover occurs in short and 
specific segments of the sequence. We first explore the effect of the 
presence of a crossover hotspot in one of the copies (in this case, the 
original copy). We show that if the hotspot is located toward the left of 
the whole simidated region (this is, furthest away from the single-copy 
region in our model) the average variation within the whole original 
block is significantly reduced compared with that found under the 
SCC model (Figure 6). In contrast, if the hotspot is located adjacent to 
the single-copy region, variation is reduced to a much lesser extent. 



Simulations with centered hotspots reach intermediate levels of vari- 
ation. Minimum and maximum variation are achieved for R = 0 and 
under the SCC model, respectively. The average variation correspond- 
ing to any hotspot location falls between these two extreme values for 
a given IGC rate. 

To better understand the reasons behind this decrease in variation, 
we calculated nucleotide variation in 1-kb bins and analyzed results 
for several hotspot locations. Results are depicted in Figure 7. We 
observe that IGC between paralogous regions makes the variation 
pattern identical for original and duplicated blocks. We also observe 
that the HSC model affects variation levels on each bin depending on 
their position relative to the hotspot. Bins to the left of the hotspot 
have the same levels of variation as they have under the SCC model; 
bins to the right of the hotspot have the same level of variation as they 
have under nuU crossover; and bins within the hotspot have interme- 
diate values of variation between these two extremes. In the latter case, 
the fall in variation is gradual (from left to right) and the variation on 
each bin within the hotspot depends on the hotspot length and the 



C = 0 



C = 1 



C = 50 












Figure 4 LD patterns under different 
crossover models. Average values for 
1000 simulations are shown. LD be- 
tween each pair of windows along the 
sequence (D') is coded with a number 
between 0 and 1 and represented 
with a color (from white to dark blue 
to light green). Three different IGC 
rates (0, 1, and 50) are represented 
in columns and three different cross- 
over conditions are shown in rows: no 
crossover, SCC (R = 50), and WRC (R = 
50). The red lines below the names 
identifying each block show regions 
undergoing crossover. In the first 
row, where no crossover is acting, 
a dark blue diagonal line appears 
when IGC is active (and increasing 
with IGC rate) representing LD be- 
tween paralogous windows of dupli- 
cate blocks. LD within the duplicate 
region is high when no IGC is acting 
and when IGC is high (C = 50) but 
decreases with a medium IGC rate. 
LD between duplicates (dark blue di- 
agonal line) decreases when crossover 
is active on the single-copy region 
(SCC model, row 2) and on the whole 
region (WRC model, row 3) with re- 
spect to R = 0. As expected, crossover 
breaks LD blocks in the regions where 
it is acting. IGC also breaks LD blocks 
within duplicates if crossover is active 
(both in the SCC model and in the 
WRC model). 
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Figure 5 Comparison between data from simulations under the SCO model and the WRC model. Results are shown for (A) -rr^^, (B) ■n/^, and (C) 
E(Dsum)- Continuous lines are based on the SCC model and are shown for different crossover rates (R = 10, 50, and 100). As expected, results from 
sec simulations are in very good agreement with theoretical expectations. Results from WRC simulations agree with theoretical expectations 
(discontinuous lines) for R' = (2/3)R, showing that allowing crossover to occur in the duplicate regions has an effect identical to that of effectively 
reducing the crossover rate by one third on the SCC model. Although we have not implemented any IGC rate dependence on sequence similarity 
between duplicates, according to Walsh (1 987) IGC rates C > 0.2 would ensure the prevalence of stable concerted evolution at least temporarily 
in the face of genetic drift (see File SI). The shaded area indicates the region that lies beyond this threshold, where both theoretical predictions 
and results from our simulations might not be biologically realistic. 



position of the bin within the hotspot. These results show that hotspot 
location within regions undergoing constant IGC can affect levels of 
variation within duplicate regions. 

Given the evidence that crossover hotspot location is dependent on 
sequence motifs, at least in the case of humans (Myers et al. 2008), it 
would make sense for hotspots to be present in both duplicates. Since 
the effect of crossover is symmetrical with respect to the center of the 
simulated region while IGC acts between paralogous regions, the de- 
crease in variation within duplicate blocks observed under the HSC 
model with one hotspot is not maintained with two hotspots (Figure 
S6 and Figure S7). 

Figure 8 and Figure S8 show LD patterns (D' and r^, respectively) 
under the HSC model with one and two hotspots for different IGC 
rates. As expected, if C = 0, highly identifiable LD blocks form to the 
left and right of the hotspot or hotspots. Contrary to the SCC model, 



LD blocks extend from the original to the duplicated block. For C > 0 
and a single hotspot, a complex pattern appears driven by the break- 
down of LD blocks within duplicates. As we have observed for the 
SCC model and the WRC model, LD blocks within duplicates break 
down strongly with IGC only if crossover is active between them. 
Under the HSC model with a single hotspot, LD breaks down to 
the left of the hotspot and in the corresponding paralogous region 
in the duplicated block; this breakdown is unnoticeable, although 
present, to the right of the hotspot for C = 1. LD between duplicates 
is greater to the right of the hotspot location than to the left of the 
hotspot location since crossover is acting between paralogous regions 
only to the left of the hotspot. 

Under the HSC model with two hotspots, crossover generates 
a symmetrical pattern of LD blocks in both duphcates. As a conse- 
quence, when IGC is active between paralogous regions, it breaks 




Figure 6 Variation within a duplicate block under differ- 
ent crossover conditions. The plot shows variation within 
duplicates under the HSC model. Each point corresponds 
to the average equilibrium value over 1000 simulation 
runs. Columns indicate different hotspot locations on the 
original block (illustrated by the red lines in the diagram). 
Expected values [■nj-) for R = 0 and for the SCC model 
(R = 10) are shown to the left and to the right of the plot 
respectively. Regardless of the width and localization of 
the hotspot, variation within duplicates is decreased in 
comparison with the SCC model but increased with re- 
spect to R = 0. Hotspots located the furthest away from 
the duplicated block (to the left) have the strongest effect 
in lowering the amount of variation, while those localized 
closest to the duplicated block still lower the variation but 
to a lesser degree (to the right). Hotspots centered in the 
original block have an intermediate effect. 
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Figure 7 Distribution of variation along the sequence under different 
crossover conditions. Comparison between variation along the se- 
quence when R = 0 (top row) and when crossover occurs (R = 10) on 
different conditions. Red rectangles indicate the regions undergoing 
crossover. In the bottom, variation along the sequence on SCC model 
is shown. Plots in the middle show different HSC model cases (differ- 
ent locations and lengths of the crossover hotspot region). Circles 
correspond to average pairwise differences calculated by bins within 
the original, single-copy and duplicated blocks, respectively. Each 
block is divided into five bins. Bins to the left of the hotspot have an 
amount of variation similar to that found under the SCC model 
whereas those to the right have a variation level equivalent to that 
of a model with no crossover (R = 0). Bins within the hotspot have 
intermediate levels of variation, which are lower for bins that are closer 
to the single-copy region. Additionally, original and duplicated blocks 
have identical (non-symmetrical) patterns of variation within them. 
This figure is for C = 0.5. Equivalent results are attained for greater 
values of C. 



down LD not only in the external regions with respect to the hotspots 
but also in their corresponding paralogous (internal) regions, resulting 
in a complete breakdown of the LD along the duplicated blocks. 

Overview 

We have described the effects of IGC under different crossover models 
and highlighted the importance of the distribution of crossover 
junctions in the attainment of variation within and between duplicates. 
As we have shown, crossover between paralogous regions provides the 
opportunity for IGC to effectively transfer new variants between 
paralogous regions and for IGC to break LD within duplicates more 



efficiently. If crossover is not acting between copies, variation will be 
lower and LD will be higher within paralogous regions for the same 
IGC rates. In the case of the HSC model, we can find both situations in 
the same duplicated region, creating a complex pattern of nucleotide 
variation and LD along the duplicates. We have summarized the 
different possible scenarios in Table 3. 

DISCUSSION 

The complexity behind the interplay between IGC and crossover was 
already evident from theoretical models of the effect IGC and 
crossover on levels of variation within and between duplicates under 
neutrality (Ohta 1982, 1983; Nagylaki 1984; Innan 2002, 2003). How- 
ever, extending its application not only to duplicated genes but also to 
SDs, which can span large areas of the genome and are not necessarily 
in tandem, demands the incorporation of more realistic crossover 
models. To tackle SD evolution, forward-time simulators promise to 
be an efficient tool for intense exploration of broad ranges of param- 
eter values. Here, we have presented a first glimpse at what forward- 
time simulations can offer, limiting our analysis to neutral evolution 
and thus, providing a null-model for future scans of areas under se- 
lective pressure within SDs. 

Increased variation within duplicates undergoing IGC is an ac- 
cepted phenomenon (Baltimore 1981; Ohta 1982, 1983; Nagylaki 
1984; Walsh 1987; Innan 2002, 2003; Thornton 2007; Ohta 2010; 
Teshima and Innan 2012) and it has been observed in humans (Bosch 
et al. 2004; Hallast et al. 2005) as well as in other species (Nielsen et al. 
2003; Rane et al. 2010; Willett 2013). In this study, we have shown 
that the extent to which variation within duplicates is increased due to 
IGC is highly dependent upon the distribution of crossover junctions. 
In the case of crossover hotspots, if they happen to fall inside dupli- 
cations, the distribution of variation within these regions might be 
altered, with a strong dependence on the hotspot location. This com- 
plex pattern of variation might affect the fate of SDs and could be 
important to take into account when calculating IGC rates. The impli- 
cations of increased variation on the evolutionary fate of SDs are 
largely unexplored. For instance, high variation might increase the 
possibihty of fixation of compensatory mutations (Plotnikova et al. 
2007) and afford more opportunities for natural selection to act upon 
standing variation (Katju et al. 2008). This wUl not be independent of 
the underlying recombination scenario under which certain levels of 
variability are achieved. 

Abundant IGC between duplicates may facilitate parallel selective 
sweeps if increased gene dosage is positively selected (Hanikenne et al. 
2013). IGC is additionally thought to be one of the mechanisms re- 
sponsible for the elimination of deleterious mutations in polyploid 
asexual genetic systems (Khakhlova and Bock 2006) and in the human 
Y chromosome (Rozen et al. 2003; Marais et al. 2010), allowing them 
to escape MuUer's ratchet. Alternatively, deleterious effects of IGC also 
have been reported; Casola et al. (2012) demonstrated that the intro- 
duction of deleterious alleles via IGC has happened in at least 1% of 
human genes and identified thousands of potentially deleterious muta- 
tions that could be disease-causing if they were to have the same fate. 

Another interesting result concerns LD. We have observed that the 
relationship between LD and IGC rate is nonlinear because both very 
low and very high IGC rates will produce high LD within duplicates, 
whereas intermediate IGC rates produce intermediate levels of LD. 
This effect cannot be observed if crossover is active between duplicates 
because the interplay between IGC and crossover drastically increases 
the breakdown of LD within duplicates. We have also analyzed the levels 
of LD between duplicates and foimd that they have a positive correlation 
with IGC rates and a negative correlation with crossover rates. 
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Figure 8 Comparison of LD between 
scenarios with one or two crossover 
hotspots. Average values for 1000 
simulations are shown. LD between 
each pair of windows along the se- 
quence (D') is coded with a number 
between 0 and 1 and represented with 
a color (from white to dark blue to light 
green). Three different IGC rates (0, 1, 
and 50) are represented in columns. 
Rows show the effect of crossover 
(R = 50) located in one specific region 
of the original block or in both the 
original and duplicated blocks (in 
paralogous regions). The red line be- 
low the names identifying each block 
shows regions undergoing crossover. 
In the first column, the effect of cross- 
over delimiting LD blocks is clear. 
When IGC is active (in the second 
and third column), a complex pattern 
of LD appears along the sequence to- 
gether with the dark blue diagonal line 

representing LD between duplicate regions (stronger when no crossover is acting between paralogous windows of the duplicate blocks). With the 
presence of one hotspot on the original block, there are paralogous windows to the right of each duplicate between which there is no crossover 
and, thus, IGC has lower power to break LD in these fragments. In the case of two hotspots, this situation disappears and the combination of 
crossover and IGC breaks LD within duplicate blocks. 






Gene conversion is already known to affect patterns of LD and is 
considered to be the main cause of LD breakdown over short distances 
(Andolfatto and Nordborg 1998; Ardlie et al. 2001; Frisse et al. 2001; 
Plagnol et al. 2006). However, this observation is restricted to gene 
conversion within the same loci on different chromosomes, referred to 
as intralocus gene conversion, interallelic gene conversion, or simply 
gene conversion. It has been suggested that similar molecular mech- 
anisms might be behind both intralocus and ICG (Jeffreys and May 
2004). However, even if a similar mechanism could be illustrated 
(Hastings 2010), it does not imply that their effects on variation or 
on patterns of LD should be equal. In this work, we have exclusively 
dealt with gene conversion between different loci (IGC) and have 
analyzed, through simulations, the effect of IGC on LD patterns (D' 
and r^) not only within but also between duphcates. Patterns of LD 
inside gene families have received some attention (Hallast et al. 2005), 
but the causes for the appearance and maintenance of these patterns 
have not been explored thoroughly. We have demonstrated that in the 
case of IGC, the interplay with crossover is crucial to determine the 
levels of LD. We stressed the importance of differentiating between 



LD within and LD between duplicates because they do not respond in 
the same way to the action of IGC and crossover and might prove to 
be useful in ascertaining biases such as donor-acceptor bias. 

Under neutrality, there are at least two possible mechanisms by 
means of which different levels of variation and LD can be attained 
along a duphcated sequence: nonhomogeneous distribution of IGC 
events and nonhomogeneous distribution of crossover junctions. IGC 
rates are known to be dependent on sequence similarity between 
paralogous regions. We suspect that setting minimum identity 
thresholds for IGC to occur (see File SI) may cause patterning inside 
SDs. Our simulator incorporates the possibility of setting these thresh- 
olds but the effects of these are beyond the scope of this paper. 
Patterning could also be caused by the nonhomogeneous distribution 
of crossover junctions that we have explored here. AH crossover mod- 
els share a common characteristic: given a homogeneous IGC rate, 
crossover is effective in increasing variation and in breaking LD within 
duplicates only if it is acting between paralogous regions. Of course, 
crossover does not play an "active" role in these effects; it only sets the 
ground for IGC to cause them. 



Table 3 Summary of variation and LD measures under different IGC and crossover conditions 







IGC Rate 




Crossover Between Paralogous Regions 




C~ 1 


C~ 100 


R« 0 

sec (R ~ 50) 


~ % (min) 
TTb ~ 20Td (max) 
TTs « 20Td (max) 
LD„ ^ high 
LDb ~ high 
Tr„ « 0 (min) 
TTb ~ 20 Td (max) 
■775 « 2©Td (max) 
LD„ w high 
LDb ~ 0 (min) 


Ti„ w intermediate 
TTb = intermediate 
-iTs = low 

LD„ = intermediate 
LDb = high 
TT^ « 20 (max) 
TTb = intermediate 
TTj = intermediate 
LD„ = intermediate 
LDb = intermediate 


IT™ 0 (min) 
TTb ~ 0 (min) 
TTs 0 (min) 
LD„ ^ high 
LDb ~ 1 (max) 
TT„ 0 (min) 
TTb ~ 0 (min) 
TTs 0 (min) 
LD„ low 
LDb ^ high 


Td is time in generations since duplication; LD, linkage disequ 


llbrium; IGC, interlocus gene 


conversion; SCC, single-copy crossover. 
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The presence of crossover hotspots within SDs is of special re- 
levance given the asymmetrical patterns it can generate within them. If 
a crossover hotspot is located inside a copy of the duplication but not 
in its paralog, increased variation might only be found in the region 
that is separated from its paralog by the hotspot and not in the rest of 
the SD. This asymmetry might tend to disappear via IGC (if it were 
caused by a sequence motif), either by the elimination of the hotspot 
or its transfer to the duplicate. For a short period of time, however, 
this asymmetry might by decisive in determining the evolutionary fate 
of genes located in different regions within the SDs. 

The analysis that we presented here covers a broad range of IGC 
rates that spans the observed ranges in a wide variety of species. The 
applicability of our results to real data depends on the extent to which 
the measurement of IGC rates between dupUcates is accurate. As 
evidenced by Mansai and Innan (2010) some methods to detect IGC 
rates, such as GENECONV (Sawyer 1989), can underestimate IGC 
rates if they are very high. New methods are still being developed (e.g., 
Dumont and Eichler 2013). We believe that for tandem duplications, 
the LD measurement method that we have implemented here would 
be able to detect rates of IGC between duplicates, especially if they are 
high. A formal comparison between our method and others would be 
an interesting focus of further research. 

Given the relevance of IGC between SDs in primate and human 
recent evolution (Fawcett and Innan 2013) and their likely role in 
adaptation (Bailey and Eichler 2006; Han et al. 2009) we have pro- 
vided a testing ground by analyzing the neutral scenario. Exploring 
the effects of natural selection on the evolution of duplicated regions 
of the genome would be a natural next step and an important and 
surely interesting endeavor. 
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